We wanted to build on the previous hypothesis that distance could be used to predict the fare of a route by adding the number of passengers who fly the route per day on average. We feel like more popular flights would be cheaper than those with low flight traffic. Additionally we felt that the best 3rd explanatory variable to include in this analysis was the relationship between distance and passengers. The other options based on the provided dataset just didn’t seem to mesh as well with the two that we have included already. Given the differences in value ranges between all the variables and outputs we choose to use the log of each value.
\[ \underbrace{Y_i}_\text{fare} \underbrace{=}_{\sim} \overbrace{\beta_0}^{\stackrel{\text{y-int}}{\text{base fare}}} + \overbrace{\beta_1}^{\stackrel{\text{slope}}{\text{baseline}}} \underbrace{X_{1i}}_\text{ldistance} + \overbrace{\beta_2}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{lpassen} + \overbrace{\beta_3}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{ldist:lpassen} + \overbrace{\beta_4}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{4i}}_\text{LargeShare} + \overbrace{\beta_5}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{5i}}_\text{y00} + \overbrace{\beta_6}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{6i}}_\text{y97} +\epsilon_i \]
Below is the Multiple regression result using distance, passengers, and with a few additional variables.
lm.mult <-lm(lfare ~ ldist + lpassen + ldist:lpassen + LargeShare + y00 + y97, data=IO_airfare)
summary(lm.mult) %>%
pander(caption= "HW 4 Multiple regression results w/ extra estimators")
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | 7.596 | 0.3353 | 22.65 | 8.764e-108 |
| ldist | -0.3199 | 0.05027 | -6.363 | 2.168e-10 |
| lpassen | -0.8653 | 0.0536 | -16.15 | 4.459e-57 |
| LargeShare | 0.0643 | 0.01249 | 5.149 | 2.729e-07 |
| y00 | 0.0662 | 0.01184 | 5.59 | 2.406e-08 |
| y97 | -0.02985 | 0.01184 | -2.522 | 0.01171 |
| ldist:lpassen | 0.1199 | 0.008095 | 14.81 | 1.618e-48 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 4595 | 0.3275 | 0.4376 | 0.4369 |
Below is the Multiple regression result using distance, passengers, but without the extra variables
lm.mult2 <-lm(lfare ~ ldist + lpassen + ldist:lpassen, data=IO_airfare)
summary(lm.mult2) %>%
pander(caption= "HW 4 Simple Multiple regression w/o extra estimators")
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | 8.08 | 0.3288 | 24.57 | 2.306e-125 |
| ldist | -0.3855 | 0.0496 | -7.771 | 9.51e-15 |
| lpassen | -0.9209 | 0.05336 | -17.26 | 1.027e-64 |
| ldist:lpassen | 0.1277 | 0.00808 | 15.81 | 7.441e-55 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 4595 | 0.3301 | 0.4282 | 0.4278 |
confint(lm.mult2, level = 0.95) %>%
pander(caption= "HW 4 Estimators 95% Conf Int's")
| Â | 2.5 % | 97.5 % |
|---|---|---|
| (Intercept) | 7.435 | 8.724 |
| ldist | -0.4827 | -0.2882 |
| lpassen | -1.025 | -0.8163 |
| ldist:lpassen | 0.1119 | 0.1436 |
Here are the results from HW 2 regression, prediction of fare using just distance.
lm.sim <-lm(fare ~ dist, data=IO_airfare)
summary(lm.sim) %>%
pander(caption= "HW 2 simple regression results")
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | 103.3 | 1.643 | 62.87 | 0 |
| dist | 0.07631 | 0.001412 | 54.05 | 0 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 4595 | 58.55 | 0.3888 | 0.3886 |
Here is the Base equation for the regression w/o Extra variables
\[ \underbrace{Y_i}_\text{fare} \underbrace{=}_{\sim} \overbrace{\beta_0}^{\stackrel{\text{y-int}}{\text{base fare}}} + \overbrace{\beta_1}^{\stackrel{\text{slope}}{\text{baseline}}} \underbrace{X_{1i}}_\text{ldistance} + \overbrace{\beta_2}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{lpassen} + \overbrace{\beta_3}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{ldist:lpassen} + \epsilon_i \]
Here is the original equation for the regression with the appropriate coefficients now included.
\[ \underbrace{Y_i}_\text{lfare} \underbrace{=}_{\sim} \overbrace{8.074}^{\stackrel{\text{y-int}}{\text{base lfare}}} + \overbrace{-0.3854}^{\stackrel{\text{slope}}{\text{baseline}}} \underbrace{X_{1i}}_\text{ldistance} + \overbrace{-0.9208}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{lpassen} + \overbrace{0.1277}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{ldist:lpassen} + \epsilon_i \]
#b <- coef(lm.mult)
## Hint: library(car) has a scatterplot 3d function which is simple to use
# but the code should only be run in your console, not knit.
library(car)
#scatter3d(fare ~ dist + passen, data=IO_airfare)
## To embed the 3d-scatterplot inside of your html document is harder.
#Perform the multiple regression
#Graph Resolution (more important for more complex shapes)
graph_reso <- 0.5
#Setup Axis
axis_x <- seq(min(IO_airfare$ldist), max(IO_airfare$ldist), by = graph_reso)
axis_y <- seq(min(IO_airfare$lpassen), max(IO_airfare$lpassen), by = graph_reso)
#Sample points
lmnew <- expand.grid(ldist = axis_x, lpassen = axis_y, KEEP.OUT.ATTRS=F)
lmnew$Z <- predict.lm(lm.mult2, newdata = lmnew)
lmnew <- acast(lmnew, lpassen ~ ldist, value.var = "Z") #y ~ x
#Create scatterplot
plot_ly(IO_airfare,
x = ~ldist,
y = ~lpassen,
z = ~lfare,
text = rownames(IO_airfare),
type = "scatter3d",
mode = "markers", color=~lfare) %>%
add_trace(z = lmnew,
x = axis_x,
y = axis_y,
type = "surface")
#add_trace(z = lmnew,
# x = axis_x,
# y = axis_y,
# type = "surface")
Based on the multiple regression, the base cost of a ticket would be $118.70, for each additional percentage increase in distance the fare would decrease by a percentage of 0.3852 and for each additional percent increase in average passengers the fare would decrease by a percentage of 0.9208. The strength or the relationship between Distance and passengers is ~0. The P-values for each of these terms are all incredibly close to 0.
These relationships are visible best when viewing the 3d plot. It is quickly apparent that all estimators have similarly weighted affects on the predicted values, as the points are spread evenly through the central area of the chart..
Assuming that our sample is random the following Q-Q plots aid in examining the residuals of our points. The first primarily helps to show if variance remains constant across our variables. The second shows some minor signs of right skewness, we agree this is likely due to the fact that the regression fails to predict base costs of a flight leaving these values up to B0. The third and final plot helps determine if the order of the data is important, usually this is needed for time sorted data but we noticed this set is sorted alphabetically by origin point so we included this to see if any patterns presented themselves.
From these plots the primary change from the non-log version is that the extremes now show significantly less variance so the confidence in predictions for extremes may be greater.
par(mfrow=c(1,3))
plot(lm.mult2,which=1:2)
plot(lm.mult2$residuals)
X401ksubs <- read_excel("401ksubs.xls")
X401ksubsF <- X401ksubs %>% filter(fsize== 1)
X401ksubsF$fsize <- as.factor(X401ksubsF$fsize)
There are 2017 single person households in the data set.
This tells us that for every one dollar increase in income net financial wealth increases by 0.95 dollars. Also that for every year that age increases, net financial wealth increases by 1.03 dollars.
The intercept implies that people are 60,000 dollars in debt from the moment they are born (nettfa is measured in 1,000s). This is because at the moment of birth, income is going to be 0, and so is age.
lm.mult3 <-lm(nettfa ~ inc + age, data=X401ksubs)
summary(lm.mult3)
##
## Call:
## lm(formula = nettfa ~ inc + age, data = X401ksubs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -509.27 -18.71 -4.09 10.02 1464.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -60.69654 2.59633 -23.38 <2e-16 ***
## inc 0.95336 0.02528 37.72 <2e-16 ***
## age 1.03078 0.05912 17.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.31 on 9272 degrees of freedom
## Multiple R-squared: 0.1691, Adjusted R-squared: 0.1689
## F-statistic: 943.2 on 2 and 9272 DF, p-value: < 2.2e-16
When we conducted a t-test for each of the variables we got high t-values and p-values of less than 0.01 each time so we should be able to reject the null hypotheses that our betas are equal to zero.
lm.mult3 <-lm(nettfa ~ inc, data=X401ksubs)
summary(lm.mult3)
##
## Call:
## lm(formula = nettfa ~ inc, data = X401ksubs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -504.39 -18.10 -4.29 6.73 1475.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.17948 1.17643 -17.15 <2e-16 ***
## inc 0.99991 0.02554 39.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59.26 on 9273 degrees of freedom
## Multiple R-squared: 0.1418, Adjusted R-squared: 0.1417
## F-statistic: 1532 on 1 and 9273 DF, p-value: < 2.2e-16
Yes, it increases the coefficient for income to 0.999. This may be because there are fewer explanatory variables and there is no longer a value changing the intercept of the regression line. We can no longer group the predictions by age and must find a slope that fits the data best as whole from the same initial base debt.